Reconfigurable processor array exploiting ilp and tlp

ABSTRACT

A processing system according to the invention comprises a plurality of processing elements, and the plurality of processing elements comprises a first set of processing elements and at least a second set of processing elements. Each processing element of the first set comprises a register file and at least one instruction issue slot, and the instruction issue slot comprises at least one functional unit. This type of processing element is dedicated for executing a thread with no or a very low degree of instruction-level parallelism. Each processing element of the second set comprises a register file and a plurality of instruction issue slots, and each instruction issue slot comprising at least one functional unit. This type of processing element is dedicated for executing a thread with a large degree of instruction-level parallelism. All processing elements are arranged to execute instructions under a common thread of control. The processing system further comprises communication means arranged for communication across the processing elements. In this way the processing system is capable of exploiting both thread-level parallelism and instruction-level parallelism in an application, or a combination thereof.

TECHNICAL FIELD

The technical field of this invention is processor architectures,particularly related to multi-processor systems, methods for programmingsaid processors and compilers for implementing said methods.

BACKGROUND ART

A Very Long Instruction Word (VLIW) processor is capable of executingmany operations within one clock cycle. Generally, a compiler reducesprogram instructions into basic operations that the processor canperform simultaneously. The operations to be performed simultaneouslyare combined into a very long instruction word (VLIW). The instructiondecoder of the VLIW processor decodes and issues the basic operationscomprised in a VLIW each to a respective processor data-path element.Alternatively, the VLIW processor has no instruction decoder, and theoperations comprised in a VLIW are directly issued each to a respectiveprocessor data-path element. Subsequently, these processor data-pathelements execute the operations in the VLIW in parallel. This kind ofparallelism, also referred to as instruction level parallelism (ILP), isparticularly suitable for applications which involve a large amount ofidentical calculations, as can be found e.g. in media processing. Otherapplications comprising more control-oriented operations, e.g. for servocontrol purposes, are not suitable for programming as a VLIW program.However, often these kinds of programs can be reduced to a plurality ofprogram threads that can be executed independently of each other. Theexecution of such threads in parallel is also denoted as thread-levelparallelism (TLP). A VLIW processor is, however, not suitable forexecuting a program using thread-level parallelism. Exploiting thelatter type of parallelism requires that sub-sets of processor data-pathelements have an independent control flow, i.e. that they can accesstheir own programs in a sequence independent of each other, e.g. theyare capable of independently performing conditional branches. Thedata-path elements in a VLIW processor, however, all execute a sequenceof instructions in the same order. The VLIW processor can, therefore,only execute one thread.

To control the operations in the data pipeline of a VLIW processor, twodifferent mechanisms are commonly used: data-stationary andtime-stationary. In the case of data-stationary encoding, everyinstruction that is part of the processor's instruction-set controls acomplete sequence of operations that have to be executed on a specificdata item, as it traverses the data pipeline. Once the instruction hasbeen fetched from program memory and decoded, the processor controllerhardware will make sure that the composing operations are executed inthe correct machine cycle. In the case of time-stationary coding, everyinstruction that is part of the processor's instruction-set controls acomplete set of operations that have to be executed in a single machinecycle. These operations may be applied to several different data itemstraversing the data pipeline. In this case it is the responsibility ofthe programmer or compiler to set up and maintain the data pipeline. Theresulting pipeline schedule is fully visible in the machine codeprogram. Time-stationary encoding is often used in application-specificprocessors, since it saves the overhead of hardware necessary fordelaying the control information present in the instructions, at theexpense of larger code size.

DISCLOSURE OF THE INVENTION

It is an object of the invention to provide a processor that is capableof exploiting both instruction-level parallelism as thread-levelparallelism or a combination thereof, during execution of anapplication.

For that purpose, a processor according to the invention comprises aplurality of processing elements, the plurality of processing elementscomprising a first set of processing elements and at least a second setof processing elements; wherein each processing element of the first setcomprises a register file and at least one instruction issue slot, theinstruction issue slot comprising at least one functional unit, and theprocessing element being arranged to execute instructions under a commonthread of control; wherein each processing element of the second setcomprises a register file and a plurality of instruction issue slots,each instruction issue slot comprising at least one functional unit, andthe processing element being arranged to execute instructions under acommon thread of control;

and wherein the number of instruction issue slots in the processingelements of the second set is substantially higher than the number ofinstruction issue slots in the processing elements of the first set;

and wherein the processing system further comprises inter-processorcommunication means arranged for communicating between processingelements of the plurality of processing elements. The computation meanscan comprise adders, multipliers, means for performing logicaloperations, e.g. AND, OR, XOR etc., lookup table operations, memoryaccesses, etc.

A processor according to the present invention allows exploiting bothinstruction-level parallelism and thread-level parallelism in anapplication, and a combination thereof. In case a program has a largedegree of instruction-level parallelism, the application can be mappedonto one or more processing elements of the second set of processingelements. These processing elements have multiple issue slots allowingthe execution of multiple instructions in parallel under one thread ofcontrol, and are therefore suited for exploiting instruction-levelparallelism. If a program has a large degree of thread-levelparallelism, but a low degree of instruction-level parallelism, theapplication can be mapped onto the processing elements of the first setof processing elements. These processing elements have a relativelylower number of issue slots allowing the mostly sequential execution ofa series of instructions under one thread of control. By mapping eachthread on such a processing element, several threads of control can bepresent in parallel. In case a program has a large degree ofthread-level parallelism, and one or more threads have a large degree ofinstruction-level parallelism, the application can be mapped onto acombination of processing elements of the first set as well the secondset of processing elements. Processing elements of the first set allowexecution of threads consisting of a mostly sequential series ofinstructions, while processing elements of the second set allowexecution of threads having instructions that can be executed inparallel. As a result, the processor according to the invention canexploit both instruction-level parallelism as well as thread-levelparallelism, depending on the type of application that has to beexecuted.

“Architecture and Implementation of a VLIW Supercomputer” by Colwell etal., in Proc. of Supercomputing 1990, p.p. 910-919, describe a VLIWprocessor, which can either be configured as two 14-operations-wideprocessor, each independently controlled by a respective controller, orone 28-operations-wide processor controlled by one controller. EP0962856discloses a Very Large Instruction Word processor, including pluralprogram counters, and is selectively operable in either a first or asecond mode. In the first mode, the data processor executes a singleinstruction stream. In the second mode, the data processor executes twoindependent program instruction streams simultaneously. Said documents,however, do neither disclose the principle of a processor array with anumber of processing elements executing threads in parallel, saidthreads varying from having no instruction-level parallelism to a largedegree of instruction-level parallelism, nor does it disclose how such aprocessor array could be realized.

An embodiment of the invention is characterized in that the processingelements of the plurality of processing elements are arranged in anetwork, wherein a processing element of the first set is arranged fordirect communication with a processing element of only the second set,via the inter-processor communication means; and wherein a processingelement of the second set is arranged for direct communication with aprocessing element of only the first set, via the inter-processorcommunication means. In practical applications, functions that have alarge degree of instruction-level parallelism and functions having a lowdegree of instruction-level parallelism will be interleaved. By choosingan architecture in which processing elements of the first type andsecond type are interleaved as well, an efficient mapping of theapplication onto the processing system is allowed.

An embodiment of the invention is characterized in that theinter-processor communication means comprise a data-driven synchronizedcommunication means. By using a data-driven synchronization mechanism togovern communication across the processing elements, it can beguaranteed that no data is lost during communication.

An embodiment of the invention is characterized in that the processingelements of the plurality of processing elements are arranged to bebypassed by the inter-processor communication means. An advantage ofthis embodiment is that it increases the flexibility of mapping theapplication onto the processing system. Depending on the degree ofinstruction-level parallelism as well as task-level parallelism of theapplication, one or more processing elements may not be used duringexecution of the application.

Further embodiments of the invention are described in the dependentclaims. According to the invention a method for programming saidprocessing system, as well as a compiler program product being arrangedfor implementing all steps of said method for programming a processingsystem, when said compiler program product is run on a computer system,are claimed as well.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows a schematic diagram of a processing system according to theinvention.

FIG. 2 shows an example of a processing element of the second set ofprocessing elements in more detail.

FIG. 3 shows an example of a processing element of the first set ofprocessing elements in more detail.

FIG. 4 shows an example of the data-path connection between processingelements in more detail.

FIG. 5 shows the application graph of an application to be executed by aprocessing system according to the invention.

DESCRIPTION OF PREFERRED EMBODIMENTS

FIG. 1 schematically shows a processing system according to theinvention. The processing system comprises a plurality of processingelements PE1-PE23, having a first set of processing elements PE1-PE15,and a second set of processing elements PE17-PE23. The processingelements can exchange data via data-path connections DPC. In thepreferred embodiment shown in FIG. 1, the processing elements arearranged such that between two processing elements of the first set,there is one processing element of the second set, and vice versa, andthe data-path connections provide for data exchange between neighboringprocessing elements. Non-neighboring processing elements may exchangedata by transferring it via a chain of mutually neighboring processingelements. Alternatively, or in addition, the processor system maycomprise one or more global busses spanning subsets of the plurality ofprocessing elements, or point-to-point connections between any pair ofprocessing elements. Alternatively, the processing system may comprisemore or less processing elements, or more than two different sets ofprocessing elements, such that processing elements in the different setscomprise different numbers of issue slots, therefore supportingdifferent levels of instruction-level parallelism per set.

FIG. 2 shows an example of a processing element of the second set ofprocessing elements PE17-PE23 in more detail. Each processing element ofthe second set of processing elements comprises two or more issues slots(ISs), and one ore more register files (RFs), each issue slot comprisingone or more functional units. The processing element in FIG. 2 comprisesfive issue slots IS1-IS5, and six functional units: two arithmetic andlogic units (ALU), two multiply-accumulate units (MAC), anapplication-specific unit (ASU), and a load/store unit (LD/ST). Theprocessing element also comprises five register files RF1-RF5. Issueslot S1 comprises two functional units: an ALU and a MAC. Functionalunits in a common issue slot share read ports from a register file andwrite ports to an interconnect network IN. In an alternative embodiment,a second interconnect network could be used in between register file andoperation issue slots. The functional unit(s) in an issue slot haveaccess to at least one register file associated with said issue slot. InFIG. 2, there is one register file associated with each issue slot.Alternatively, more than one issue slot could be connected to a singleregister file. Yet another possibility is that multiple, independentregister files are connected to a single issue slot, e.g. one differentRF for each separate read port of a functional unit in the issue slot.The data path connections DPC between different processing elements arepreferably driven from the load/store unit (LD/ST) in the respectiveprocessing element, so that communications across processing elementscan be managed as memory transactions. Preferably, a differentload/store unit (LD/ST) is used in association with the differentdata-path connections (DPC) connecting the processing element to otherprocessing elements. This way, if the processing element is directlyconnected to e.g. four other processing elements, then four differentload/store units are preferably used for communication with thoseprocessing elements, not shown in FIG. 2. In addition, furtherload/store units could be added to the data-path of a processingelement, and associated to data memories (e.g. RAM), either local to theprocessing element, or system-level memories, not shown in FIG. 2. Thefunctional units are controlled by a controller CT that has access to aninstruction memory IM. A program counter PC determines the currentinstruction address in the instruction memory IM. The instructionpointed to by said current address is first loaded into an internalinstruction register IR in the controller. The controller CT thencontrols data-path elements (functional units, register files,interconnect network) to perform the operations specified by theinstruction stored in the instruction register IR. To do so, thecontroller communicates to the functional units via an opcode-bus OB,e.g. providing operation codes to the functional units, to the registerfiles via an address-bus AB, e.g. providing addresses for reading andwriting registers in the register file, and to the interconnect networkIN through a routing-bus RB, e.g. providing routing information to theinterconnect multiplexers. Processing elements of the second setcomprise multiple issue slots, which allow exploiting instruction-levelparallelism within a thread. For example, application functions withinherent instruction-level parallelism such as Fast Fourier Transforms,Discrete Cosine Transforms and Finite Impulse Response filters can bemapped onto processing elements of the second set.

FIG. 3 shows an example of a processing element of the first set ofprocessing elements PE1-PE15 in more detail. A processing element of thefirst set of processing elements comprises a relatively low number ofissue slots, compared to processing elements of the second set ofprocessing elements. A processing element of the first set furthercomprises one or more register files and a controller. The issue slotscomprise one or more functional units, for example an arithmetic andlogic unit, a multiply-accumulate unit or an application-specific unit.The processing element in FIG. 3 comprises two issue slots IS6 and IS7and two register files RF6 and RF7. Issue slot IS6 comprises twofunctional units: an ALU and a MAC. Functional units in a common issueslot share read ports from a register file and write ports to aninterconnect network IN. Issue slot IS7 comprises a load/store unit(LD/ST) that drives the data-path connections (DPC) connecting theprocessing element with other processing element(s). Preferably, adifferent load/store unit (LD/ST) is used in association with thedata-path connections (DPC) connecting the processing element directlyto other processing elements. This way, if the processing element isdirectly connected to e.g. four other processing elements, then fourdifferent load/store units are preferably used for communication withthose processing elements, not shown in FIG. 3. In addition, furtherload/store (LD/ST) units could be added to the data-path of a processingelement, and associated to data memories (e.g. RAM), either local to theprocessing element, or system-level memories, not shown in FIG. 3. In analternative embodiment, a second interconnect network could be used inbetween register file and operation issue slots. The functional unit(s)in an issue slot have access to at least one register file associatedwith said issue slot. In FIG. 3, there is one register file associatedwith issue slot IS6, and another register file associated with issueslot IS7. Alternatively, independent register files are connected to theissue slot, e.g. one different RF for each separate read port of afunctional unit in the issue slot. The functional units are controlledby a controller CT that has access to an instruction memory IM. Aprogram counter PC determines the current instruction address in theinstruction memory IM. The instruction pointed to by said currentaddress is first loaded into an internal instruction register IR in thecontroller. The controller CT then controls data-path elements(functional units, register files, interconnect network) to perform theoperations specified by the instruction stored in the instructionregister IR. To do so, the controller communicates to the functionalunits via an opcode-bus OB, e.g. providing operation codes to thefunctional units, to the register files via an address-bus AB, e.g.providing addresses for reading and writing registers in the registerfile, and to the interconnect network IN through a routing-bus RB, e.g.providing routing information to the interconnect multiplexers.Processing elements of the first set have a relatively lower number ofissue slots and are therefore suitable for computing inherentlysequential functions, for example Huffman coding.

FIG. 4 shows an example of the data-path connection DPC betweenprocessing elements in more detail. In a preferred embodiment, thedata-path connections use a data-driven synchronization mechanism, inorder to prevent that data is lost during communication betweenprocessing elements. The data-path connection between processingelements PE2 and PE4, shown in FIG. 4, comprises two blockingFirst-In-First-Out (FIFO) buffers BF. The FIFO buffers BF are controlledby control signals hold_w and hold_r. In case processing element PE2 orPE4 is trying to write data to a FIFO buffer BF that is full, the signalhold _w is activated, halting the entire processing element untilanother processing element reads at least one data element from thatFIFO buffer, freeing up storage space in that FIFO buffer. In that casethe hold_w signal is deactivated. A clock-gating mechanism can be usedto halt the processing element from writing data to a full FIFO buffer,using the hold_w signal, as long as that FIFO buffer is full. In case aprocessing element PE2 or PE4 tries to read a value from a FIFO bufferthat is empty, the hold_r signal is activated, halting the entireprocessing element until another processing element writes at least onedata element into the FIFO buffer. At that moment the hold_r signal isdeactivated and the processing element that was halted can start readingdata from said FIFO buffer again. A clock-gating mechanism can be usedto halt a processing element from reading data from an empty FIFObuffer, using the hold_r signal, as long as that FIFO buffer is empty.

In a preferred embodiment, processing elements in both sets are VLIWprocessors, wherein processing elements of the second set are wide VLIWprocessors, i.e. VLIW processors with many issue slots, while processingelements of the first set are narrow VLIW processors, i.e. VLIWprocessors with a small number of issue slots. In an alternativeembodiment, processing elements of the second set are wide VLIWprocessors with many issue slots, and processing elements of the firstset are single-issue slot Reduced Instruction Set Computer (RISC)processors A wide VLIW processor with many issue slots allows exploitinginstruction-level parallelism in a thread running on that processor,while a single-issue slot RISC processor, or a narrow VLIW processorwith few issue slots, can be designed to efficiently execute a series ofinstructions sequentially. In practice, an application often comprises aseries of threads that can be executed in parallel, where some threadsare very poor in instruction-level parallelism, and some threadsinherently have a large degree of instruction-level parallelism. Duringcompilation of such an application, the application is analyzed anddifferent threads that can be executed in parallel are identified.Furthermore, the degree of instruction-level parallelism within a threadis determined as well. This application can be mapped onto a processingsystem according to the invention as follows. Threads that have a largedegree of instruction-level parallelism are mapped onto the wide VLIWprocessors, while threads that are very poor in instruction-levelparallelism, or have no instruction-level parallelism at all, are mappedonto the single-issue slot RISC processors, or the narrow VLIWprocessors. Communication between the different threads is mapped ontothe data-path connections DPC, as shown in FIG. 1. As a result anefficient execution of the application is allowed: multiple threads areexecuted in parallel, while simultaneously instruction-level parallelismwithin a thread can be exploited. Therefore, a processing systemaccording to the invention can exploit both instruction-levelparallelism as well as thread-level parallelism present in anapplication. In addition, the present invention has the advantage ofallowing for a proper match between the computational characteristics ofa thread, and those of the processing element it is mapped onto. Thisway, an inherently sequential function like Huffman decoding is notmapped onto a wide VLIW processor, wasting architecture resources thatgo unused due to the lack of instruction-level parallelism, but ismapped instead onto a small RISC processor that fits its computationalpatterns, the wide VLIW processor remaining available for otherfunctions.

FIG. 5 shows the application graph of an application that has to beexecuted by a processing system shown in FIG. 1. Referring to FIG. 5,the application comprises five threads TA, TB, TC, TD and TE. These fivethreads can be executed in parallel. Threads TA, TB, TC and TE have alarge degree of instruction-level parallelism, while thread TD has noinstruction-level parallelism. The threads exchange data via datastreams DS, and these data streams are buffered by data buffers DB. Whenmapping the application onto the processing system, the threads TA, TB,TC and TE are mapped each onto one of the processing elements PE17-PE23,respectively, and thread TD is mapped onto one of the processingelements PE1-PE15. One alternative is to map thread TA onto processingelement PE17, thread TB onto processing element PE19, thread TC ontoprocessing element PE21, thread TD onto processing element PE15 andthread TE onto processing element PE23. In this case the threads TC, TDand TE are mapped onto processing elements that are directly connectedvia data-path connections DPC, i.e. processing element PE21 directlycommunicates with processing element PE15, and processing element PE15directly communicates with processing element PE23. For threads TA andTB this is not the case, since processing element PE17 has tocommunicate with processing elements PE19 and PE21, which are indirectlycoupled to PE17 via PE7 and PE9, respectively. Likewise, processingelement PE19 has to communicate with processing element PE23, which isindirectly coupled to PE19 via PE11. In these cases the processingelements PE7, PE9 and PE11 can be by-passed in order to allow a directcommunication between the processing elements. The data-streams DS aremapped onto data-path connections DPC, and the data-buffers DB aremapped onto a FIFO buffer BF, as shown in FIG. 4. In differentembodiments, the application graph may comprise more or less threads, aswell as a different ratio between threads having a large degree ofinstruction-level parallelism and a low degree of instruction-levelparallelism.

In a preferred embodiment, as shown in FIG. 1, the processing elementsof the first and the second set are interleaved, i.e. a processingelement of the first set is arranged for direct communication with aprocessing element of only the second set, and a processing element ofthe second set is arranged for direct communication with a processingelement of only the first set. As a result, there is never a penalty ofmore than one by-passed processing element for communication between twothreads executing on different processing elements.

The degree of instruction-level parallelism and thread-level parallelismthat can be exploited will vary from one application to the other,varying from applications having a low degree of thread-levelparallelism wherein each thread has a high degree of instruction-levelparallelism, to applications having a large degree of thread-levelparallelism wherein each thread has no instruction-level parallelism.The flexibility of a processing system as shown in FIG. 1 allows to mapthe whole range of applications onto the processing system, byby-passing processing elements onto which no thread can be mapped.

Referring to FIG. 2, the interconnect network IN is a fully connectednetwork, i.e. all functional units are coupled to all register filesRF1, RF2, RF3, RF4 and RF5. Alternatively, the interconnect network INcan be a partially connected network, i.e. not every functional unit iscoupled to all register files. In case of a large number of functionalunits, the overhead of a fully connected network will be considerable interms of silicon area and power consumption. During design of the VLIWprocessor it is decided to which degree the functional units are coupledto the register file segments, depending on the range of applicationsthat has to be executed by the processing system.

Referring again to FIG. 2, the processing element comprises adistributed register file, i.e. register files RF1, RF2, RF3, RF4 andRF5. Alternatively, the processing element may comprise a singleregister file for all functional units. In case the number of functionalunits of a VLIW processor is relatively small, the overhead of a singleregister file is relatively small as well.

In an alternative embodiment, the processing elements of the second setcomprise a superscalar processor. A superscalar processor also comprisesmultiple execution units that can perform multiple operations inparallel, as in case of a VLIW processor. However, the processorhardware itself determines at runtime which operation dependencies existand decides which operations to execute in parallel based on thesedependencies, while ensuring that no resource conflicts will occur. Theprinciples of the embodiments for a VLIW processor, described in thissection, also apply for a superscalar processor. In general, a VLIWprocessor may have more execution units in comparison to a superscalarprocessor. The hardware of a VLIW processor is less complicated incomparison to a superscalar processor, which results in a betterscalable architecture. The number of execution units and the complexityof each execution unit, among other things, will determine the amount ofbenefit that can be reached using the present invention.

In other embodiments of a processing system according to the invention,the processing system may comprise more or less processing elements thanthe processing system shown in FIG. 1. Alternatively, the processingelements may be arranged differently, for example in a one-dimensionalnetwork, or not in an interleaved fashion, i.e. between two processingelements of the first set more than one processing element of the secondset is located, and vice versa The architecture of the processing systemmay depend on the range of applications that is expected to be executedon the processing system, for example the amount of thread-levelparallelism that range of applications has relative to the amount ofinstruction-level parallelism.

It should be noted that the above-mentioned embodiments illustraterather than limit the invention, and that those skilled in the art willbe able to design many alternative embodiments without departing fromthe scope of the appended claims. In the claims, any reference signsplaced between parentheses shall not be construed as limiting the claim.The word “comprising” does not exclude the presence of elements or stepsother than those listed in a claim. The word “a” or “an” preceding anelement does not exclude the presence of a plurality of such elements.In the device claim enumerating several means, several of these meanscan be embodied by one and the same item of hardware. The mere fact thatcertain measures are recited in mutually different dependent claims doesnot indicate that a combination of these measures cannot be used toadvantage.

1. A processing system comprising a plurality of processing elements,the plurality of processing elements comprising a first set ofprocessing elements and at least a second set of processing elements,wherein each processing element of the first set comprises a registerfile and at least one instruction issue slot, the instruction issue slotcomprising at least one functional unit, and the processing elementbeing arranged to execute instructions under a common thread of control,wherein each processing element of the second set comprises a registerfile and a plurality of instruction issue slots, each instruction issueslot comprising at least one functional unit, and the processing elementbeing arranged to execute instructions under a common thread of control,and wherein the number of instruction issue slots in the processingelements of the second set is substantially higher than the number ofinstruction issue slots in the processing elements of the first set, andwherein the processing system further comprises inter-processorcommunication means arranged for communicating between processingelements of the plurality of processing elements.
 2. A processing systemaccording to claim 1, characterized in that the processing elements ofthe plurality of processing elements are arranged in a network, whereina processing element of the first set is arranged for directcommunication with a processing element of only the second set, via theinter-processor communication means, and wherein a processing element ofthe second set is arranged for direct communication with a processingelement of only the first set, via the inter-processor communicationmeans.
 3. A processing system according to claim 1, characterized inthat the plurality of issue slots organized in a processing element ofthe second set of processing elements share at least one common controlsignal for controlling instruction execution.
 4. A processing systemaccording to claim 1, characterized in that the processing elements ofthe first set of processing elements are arranged for issuing only oneoperation per cycle.
 5. A processing system according to claim 1,characterized in that the processing elements of the second set ofprocessing elements are Very Large Instruction Word processors, whereinthe register file is accessible for said processing elements by thecorresponding functional units and wherein the processing elementsfurther comprise a local communication network for coupling the registerfile and the corresponding functional units.
 6. A processing systemaccording to claim 1, characterized in that the processing elements ofthe first set of processing elements are Very Large Instruction Wordprocessors, wherein the register file is accessible for said processingelements by the corresponding functional units and wherein theprocessing elements further comprise a local communication network forcoupling the register file and the corresponding functional units.
 7. Aprocessing system according to claim 5, characterized in that theregister file corresponding to a processing element is a distributedregister file.
 8. A processing system according to claim 5,characterized in that the local communication network corresponding to aprocessing element is a partially connected communication network.
 9. Aprocessing system according to claim 1, characterized in that theinter-processor communication means comprise a data-driven synchronizedcommunication means.
 10. A processing system according to claim 9,characterized in that the data-driven synchronized communication meanscomprise a blocking First-In-First-Out buffer.
 11. A processing systemaccording to claim 1, characterized in that the processing elements ofthe plurality of processing elements are arranged to be bypassed by theinter-processor communication means.
 12. A method for programming aprocessing system, wherein the processing system comprises a pluralityof processing elements, the plurality of processing elements comprisinga first set of processing elements and at least a second set ofprocessing elements, wherein each processing element of the first setcomprises a register file and at least one instruction issue slot, theinstruction issue slot comprising at least one functional unit, and theprocessing element being arranged to execute instructions under a commonthread of control, wherein each processing element of the second setcomprises a register file and a plurality of instruction issue slots,each instruction issue slot comprising at least one functional unit, andthe processing element being arranged to execute instructions under acommon thread of control, and wherein the number of instruction issueslots in the processing elements of the second set is substantiallyhigher than the number of instruction issue slots in the processingelement of the first set, and wherein the processing system furthercomprises inter-processor communication means arranged for communicatingbetween processing elements of the plurality of processing elements, andwherein the method of programming the processing system comprises thefollowing steps: identifying a first set of functions in an applicationgraph wherein each function inherently contains instructions to beexecuted mainly sequentially, identifying a second set of functions inan application graph wherein each function inherently containsinstruction-level parallelism, mapping the first set of functions ontoprocessing elements of the first set of processing elements, mapping thesecond set of functions onto processing elements of the second set ofprocessing elements.
 13. A method for programming a processing systemaccording to claim 12, characterized in that the method furthercomprises the step of: bypassing a processing element of the pluralityof processing elements by the inter-processor communication means.
 14. Acompiler program product being arranged for implementing all steps ofthe method for programming a processing system according to claim 12,when said compiler program product is run on a computer system.